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Abstract 

The alternating direction method of multipliers (ADMM) has been recognized as a versatile approach 
for solving modern large-scale machine learning and signal processing problems efficiently. When the data 
size and/or the problem dimension is large, a distributed version of ADMM can be used, which is capable 
of distributing the computation load and the data set to a network of computing nodes. Unfortunately, 
a direct synchronous implementation of such algorithm does not scale well with the problem size, as 
the algorithm speed is limited by the slowest computing nodes. To address this issue, in a companion 
paper, we have proposed an asynchronous distributed ADMM (AD-ADMM) and studied its worst-case 
convergence conditions. In this paper, we further the study by characterizing the conditions under which 
the AD-ADMM achieves linear convergence. Our conditions as well as the resulting linear rates reveal 
the impact that various algorithm parameters, network delay and network size have on the algorithm 
performance. To demonstrate the superior time efficiency of the proposed AD-ADMM, we test the AD- 
ADMM on a high-performance computer cluster by solving a large-scale logistic regression problem. 
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I. Introduction 

Consider the following optimization problem 

N 

( 1 ) 

where each /j : —> R is the cost function and h : R”^ —)• R U {oo} is a non-smooth, convex 

regularization function. The regularization function is used for obtaining structured solutions (e.g., spar¬ 
sity) and/or is an indicator function which enforces x to lie in a constraint set [2, Section 5]. Many 
important statistical learning problems can be formulated as problem (1), including, for example, the 
LASSO problem [3], logistic regression (LR) problem [4], support vector machine (SVM) [5] and the 
sparse principal component analysis (PCA) problem [6], to name a few. 

Distributed optimization algorithms that can scale well with large-scale instances of (1) have drawn 
significant attention in recent years [2], [7]-[14]. Our interest in this paper lies in the distributed opti¬ 
mization method based on the alternating direction method of multipliers (ADMM) [2, Section 7.1.1]. 
The ADMM is a convenient approach of distributing the computation load of a very large-scale problem 
to a network of computing nodes. Specifically, consider a computer network with a star topology, where 
one master node coordinates the computation of a set of N distributed workers. Based on a consensus 
formulation, the distributed ADMM partitions the original problem into N subproblems, each of which 
contains either a small set of training samples or a subset of the learning parameters. At each iteration, 
the distributed workers solve the subproblems based on the local data and send the variable information 
to the master, who summarizes the variable information and broadcasts it back the workers. Through 
such iterative variable update and information exchange, the large-scale learning problem can be solved 
in a distributed and parallel manner. 

The convergence conditions of the distributed ADMM have been extensively studied; see [2], [7], 
[15]-[20]. For example, for general convex problems, references [2], [7] showed that the ADMM is 
guaranteed to converge to an optimal solution and [15] showed that the ADMM has a worst-case 0{\/k) 
convergence rate, where k is the iteration number. Considering non-convex problems with smooth /j’s, 
reference [16] presented conditions for which the distributed ADMM converges to the set of Karush-Kuhn- 
Tucker (KKT) points. For problems with strongly convex and smooth /j’s or problems satisfying certain 
error bound condition, references [17] and [21] respectively showed that the ADMM can even exhibit a 
linear convergence rate. References [18]-[20] also showed similar linear convergence conditions for some 
variants of distributed ADMM in a network with a general topology. However, the distributed ADMM in 
[2], [16] have assumed a synchronous network, where at each iteration, the master always waits until all 
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the workers report their variable information. Unfortunately, such synchronous protocol does not scale 
well with the problem size, as the algorithm speed is determined by the “slowest” workers. To improve 
the time efficiency, the works [22], [23] have generalized the distributed ADMM to an asynchronous 
network. Specifically, in the asynchronous distributed ADMM (AD-ADMM) proposed in [22], [23], the 
master does not necessarily wait for all the workers. Instead, the master updates its variable whenever it 
receives the variable information from a partial set of the workers. This prevents the master and speedy 
workers from spending most of the time waiting and consequently can improve the time efficiency of 
disfributed opfimizafion. Theorefically, it has been shown in [23] that the AD-ADMM is guaranteed to 
converge (to a KKT point) even for non-convex problem (1), under a bounded delay assumption only. 

The contributions of this paper are twofold. Firstly, beyond the convergence analysis in [23], we further 
present the conditions for which the AD-ADMM can exhibit a linear convergence rate. Specifically, we 
show fhat for problem (1) wifh some strucfured convex /j’s (e.g., strongly convex), the augmented 
Lagrangian function of the AD-ADMM can decrease by a constant fraction in every iteration of the 
algorithm, as long as the algorithm parameters are chosen appropriately according to the network delay. 
We give explicit expressions on the linear convergence conditions and the linear rate, which illustrate 
how the algorithm and network parameters impact on the algorithm performance. To the best of our 
knowledge, our results are novel, and are by no means extensions of the existing analyses [17]-[21] for 
synchronous ADMM. Secondly, we present extensive numerical results to demonstrate the time efficiency 
of the AD-ADMM over its synchronous counterpart. In particular, we consider a large-scale LR problem 
and implement the AD-ADMM on a high-performance computer cluster. The presented numerical results 
show that the AD-ADMM significantly reduces the practical running time of distributed optimization. 

Synopsis: Section II reviews the AD-ADMM in [23]. The linear convergence analysis is presented 
in Section III and the proofs are presented in Section IV. Numerical results are given in Section V and 
conclusions are drawn in Section VI. 

II. Asynchronous Distributed ADMM 

In this section, we review the AD-ADMM proposed in [23]. The distributed ADMM [2, Section 7.1.1] 

is derived based on the following consensus formulation of (1): 

N 

min fi{xi) + h{xo) (2a) 

s.t. Xi = xoMieV = N}. (2b) 
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By applying the standard ADMM [7] to problem (2), one obtains the following three simple steps: for 
iteration A: = 0,1,, update 



(3) 



(4) 


(5) 


As seen, the distributed ADMM is designed for a computing network with a star topology that consists 
of one master node and a set of N workers (see Fig. 1 in [23]). In particular, the master is responsible 
for optimizing the variable xq by (3), while each worker i, i € V, takes charge of optimizing variables 
Xi and Aj by (4) and (5), respectively. Once the master updates * 0 , it broadcasts xq to the workers; 
each worker i then updates {xi, A*) based on the received xq, and sends the new {xi, A*) to the master. 
Through such iterative variable update and message exchange, problem (2) is solved in a fully parallel 
and distributed fashion. 

However, to implement (3)-(5), the master and the workers have to be synchronized with each other. 
Specifically, according to (3), the master proceeds to update xq only if it has received update-to-date 
{xi,\i) from all the workers. This implies that the optimization speed would be determined by the 
slowest worker in the network. This is in particular the case in a heterogeneous network where the 
workers experience different computation and communication delays, in which case the master and speedy 
workers would idle most of the time. 

The distributed ADMM has been extended to an asynchronous network in [22], [23]. In the AD- 
ADMM, the master does not wait for all the workers, but updates the variable xq as long as it receives 
variable information from a partial set of workers instead. This would greatly reduce the waiting time 
of the master, and improve the overall time efficiency of distributed optimization. The AD-ADMM is 
presented in Algorithm 1, which includes the algorithmic steps of the master and those of the workers. 
Here, we denote k as the iteration number of the master (i.e., the number of times for which the master 
updates xq), and assume that, at each iteration k, the master receives variable information from workers 
in the set Ak C V = {1,..., A^}. Worker i is said to be “arrived” at iteration k if i £ Ak and unarrived 
otherwise. Notation A^. denotes the complementary set of Ak, i.e., Ak H A% = 0 and Ak U A^. = V. 
Moreover, variables dj’s are used to count the numbers of delayed iterations of the workers. The variables 
p and 7 are two penalty parameters. 
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In the AD-ADMM, the master inevitably uses delayed and old vaiiahle information for updating xq. 
As shown in step 4 of Algorithm of the Master, to ensure the used variable information not too stale, the 
master would wait until it receives the update-to-date {xi, Aj) from all the workers that have d* > r — 1, 
if any (so all the workers i G A'f, must have di < t — 1). This condition guarantees that the variable 
information is at most r iterations old, and is known as the partially asynchronous model [7]: 

Assumption 1 (Bounded delay) Let t > 1 be a maximum tolerable delay. For all i & V and iteration 
k, it must be that i G Ak U Ak-i • • • U Ak-r+i- 

In [23, Theorem 1], we have shown that under Assumption 1, some smoothness conditions on the cost 
functions /j’s (see [23, Assumption 2]) and for sufficiently large p and 7, the AD-ADMM in Algorithm 
1 is provably convergent to the set of KKT points of problem (2). Notably, this convergence property 
holds even for non-convex /j’s. In the next section, we focus on convex /j’s, and further characterize the 
linear convergence conditions of the AD-ADMM. 

III. Linear Convergence Rate Analysis 

In this section, we show that the AD-ADMM can achieve linear convergence for some structured convex 
functions. We first make the following convex assumption on problem (1) (or equivalently, problem (2)). 

Assumption 2 Each function fi is a proper closed convex function and is continuously differentiable; 
each gradient V fi is Lipschitz continuous with a Lipschitz constant L > 0; the function h is convex 
(not necessarily smooth). Moreover, problem (1) is bounded below, i.e., F* > —00 where F* denotes the 
optimal objective value of problem (1). 

Assumption 2 is the same as [23, Assumption 2], except that ffs are assumed convex here. Given this 

convex property, it is well known that the augmented Lagrangian function, i.e., 

N N 

Cp{x^,x^Q,\^) = ^ ffx^i) + /i(4) + - 4) 

i=l i=l 

N 



i=l 


would converge to F* whenever the iterates approaches the optimal solution of 

problem (2). Therefore, our analysis is based on characterizing how Cp{x^,X q,\^) can converge to F* 
linearly. Let us define 

/\k = Cp{x\xl\^)-F\ (13) 
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It has been shown in [23, Lemma 3] that > 0 for all k as long as p > L. 

In the ensuing analysis, we consider two types of structured convex cost functions, respectively 
described in the following two assumptions. 

Assumption 3 For all i (zV, each function fi is strongly convex with modulus > 0. 

Assumption 4 Each function fi{x) = gi[Aix), Vi G V, where gi : M*” M is a strongly convex 
function with modulus > 0 and Ai G is a nonzero matrix with arbitrary rank. Moreover, 

h{x) = 0. 

Note that in Assumption 4 matrix Aj can have an arbitrary rank, so fi{x) is not necessarily strongly 
convex with respect to x. Interestingly, such structured cost function appears in many machine learning 
problems, for example, the least squared problem and the logistic regression problem [5]. 

Let us first consider the strongly convex case. Under Assumption 3, the linear convergence conditions 
of the AD-ADMM are given by the following theorem. 


Theorem 1 Suppose that Assumptions 1, 2 and 3 hold true. Moreover, assume that there exists a constant 
S' G [1, A^] such that \Ak\ < S for all k and that 


p > max 


(1 + L^) + v'(l + A2)2 + 8L2a(r) 


Np 




7 > max <j /3{p, r) - ^ + 1, 8N{p - cr^) 


(14) 

(15) 


where a{T) = 1 + and I3{p,t) = 2(r — i)[( )^+^ )S+S/N i _ i _ ppgn, the 


l+SNa‘^ 

iterates generated by ( 6 ), (7) and (9) satisfy 


0 < Afc+i < 


1 ^ 


fe+i 


Aq, 


where 6 is a constant satisfying 


5>max<’ 1 ,—^-U. 


(16) 


(17) 


Theorem 1 asserts that, for problem (2) with strongly convex ffs, the augmented Lagrange function 
can decrease linearly to zero, as long as p and 7 are large enough (exponentially increasing with r). 
Equation (16) also implies that the linear rate would decrease with the delay r and the number of workers 
in the worst case. 

Analogous to Theorem 1, the following theorem shows that the AD-ADMM can achieve linear 
convergence under Assumption 4. 
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Theorem 2 Suppose that Assumptions 1, 2 and 4 hold true. Moreover, assume that there exists a constant 
S G such that < S for all k and that 



p > max 



for some constant c > 0. Then, the iterates generated by (6), (7) and (9) satisfy (16) with 6 satisfying 



Since it has been known that the (synchronous) distributed ADMM [17]-[21] can converge linearly 
given the same structured cost functions in Assumption 3 and Assumption 4, the convergence results 
presented above demonstrate that the linear convergence property can be preserved in the asynchronous 
network. We remark that (14) and (15) are sufficient conditions only. In practice, the AD-ADMM could 
still exhibit a linear convergence rate without exactly satisfying these conditions. 

The proofs of Theorem 1 and Theorem 2 are presented in the next section. The readers who are more 
interested in the numerical performance of the AD-ADMM may jump to Section V. 


IV. Proofs of Theorems 


A. Preliminaries and Key Lemmas 

Let us present some basic inequalities that will be used frequently in the ensuing analysis and key 
lemmas for proving Theorem 1 and Theorem 2. 

We will frequently use the following inequality due to Jensen’s inequality: for any a*, i = 1,..., M, 



(18) 


Moreover, for any a, b and <5 > 0, 


a + 6|p < (1 + <5)||a|P + (1 + 

0 


(19) 


The equality is also known to be true: for any vectors a, b, c and d, 



2 


( 20 ) 
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We follow [23, Algorithm 3] to write Algorithm 1 from the master’s point of view as follows: 






arg min \ fi{xi) + xj\^ + f ||®i - x\'-^^\\^ L ^ Ak 


X- 


yi^Al 


Af + p{x'l'^^ - x\'^^) Vi € Ak 
Af Vi e Al 


(21) 


( 22 ) 


x+ =arg min h{xo) - a;^ A 


N 

i 


+ fEf=i H^^-x,f + i\\x,-x 


k\\2 


(23) 


Here, index ki in (21) and (22) represents the last iteration number before iteration k for which worker 
i G Ak is arrived, i.e., i G A^.- Under Assumption 1, it must hold 

k — T <ki < k yk. (24) 

Furthermore, for workers i G A%, let us denote ki as the last iteration number before iteration k for 
which worker i is arrived, i.e., i G A^ . Then, under Assumption 1, it must hold 

k — T < ki < k yk. (25) 

In addition, denote ki {ki — T<ki< ki) as the last iteration number before iteration ki for which worker 
i G A^ is arrived, i.e., i G A^ . Then by (21) and (22), for all workers i G A^, we must have 



, , — _ k+1 

(26) 

\ fc^+l _ \ki-\-2 _ 

^2 

II 

> 

II 

> 

+ 

(27) 


Since i G A^ for all i G A^ and by (26)-(27), we can equivalently write (21) and (22) for all i G A^ as 

'*~'i 

= arg rnin fi{xi) + xfX^' + f ||®j - (28) 

Afc+i ^ ^ xf + p{xf+^ - 4-+1) 

= \f + p{x’l+^ - xf+^). (29) 

Based on these notations, we have shown in [23, Eqn. (33)] that the following lemma is true. 
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Lemma 1 Suppose that Assumption 2 holds and p > L. Then, for all k = 0,1,, 


0 <Afc+i < Afc + 


1 + p/e 


E 

i^Ak 


l®o -®o 


fci+l||2 


27 + Ii_fc+1 _fc||2 I (L'^ + {^-^)p I 

»" +l 2 +7 


\Xq - iCn 


E 

i&Ak 


|x*+‘ - , 


(30) 


where e € (0,1) is a constant. 


In particular, (30) is the same as [23, Eqn. (33)] except that here we have assumed convex /j’s. Lemma 
1 shows how the gap between the augmented Lagrangian function C, A^+^) and the optimal 
ohjective value F* evolves with the iteration number k. Notice that it follows from [23, Lemma 3] that 
Afc+i > 0 for all A; if p > L. As will be seen shortly. Lemma 1 is crucial in the linear convergence 
analysis. 

Similar to [23, Lemma 3], we next need to bound the error terms, e.g., ll®o “ 

in (30), which is caused by asynchrony of the network. Here, we present a more general result for the 
latter analysis. 


Lemma 2 Let p > 0 and j — u < ji < j where v G ^++, f G and j G {0,1,... , k}. Moreover, 
let Mj CV be any index subset satisfying |A/^ | < N for some constant N G (1, A^]- Then, the following 
inequality holds true 


E 

j=0 isAfj 


k-1 


IA _ 

1-^0 *^0 II 


,>7277771 ,_i+i||2 


< (k - ]\\n- xi 


j=0 


(31) 


Proof: See Appendix B. ■ 

Now let us consider Assumption 3. For strongly convex f^s, it is known that the following first-order 
condition holds [24]: '^x,y. 


fi{y) > fi{x) + {Vfi{x)f{y - a;) + y ||r/ - xf. 


(32) 


Based on this property, we can bound A^+i as follows. 


Lemma 3 Suppose that Assumptions 2 and 3 hold and p > a^. If p > 8N(p — a^) and 6 satisfies (17), 
then it holds that 


1 

- 4p2^ E 
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L 2 

+ 4p2^ 2^ 

ieAi 


^^i + 1 _ -I_\ 

II + 2jy 2^ 

iG^k 


„fc ^ki+l\\2 




2N 


A _ fc>+l||2 

Lq J,g 


lain - ain' ' ^ ir + ||ain - aig 


k+l ^k\\2 
0 


iaAt 


Instead, 1/7 = 0 and 6 > max{p/(T^ — 1 , 1 }, then it holds 

I \ . 


4(/9 — a‘^)N5 
L2 


Afc+1 < 


2p^N 


E 

i^j^k 




+ 


2/92 AT . 


1 


E 

i&Al 


|3.fc.+l _ fc.||2 




\ k _ ki+l ||2 

I'^O -^0 II 


i^Ak 




l^k _ „fci+l||2 

1-^0 -^0 II 


I rpk+1 _ II2 

1-^0 '^Oll • 


i€Ai 


Proof: See Appendix C. 


(33) 


(34) 


B. Proof of Theorem 1 

We use the lemmas above to prove Theorem 1. Denote 77 A l + By summing (30) and (33), we 


obtain 


Afc+i < — Afc H— 
rj 7] 

/ 27 + Np 


L'^ + {e- l)p + 2 ^ ^ ^ 

^ 2 ' j=i 


2pm , ^ -a;^||2 


^-1 ikr-4ip+^E 


\ k _ ki+l ||2 

1*^0 -^0 II 


+ 4 ^Eii-A-riiA 




i&A] 
l+P/e , 1 


+ ^)E 


A ^fci+l||2 


(35) 


iG- 4 .A; 

Here, we have used the fact of Yhi^Ak IP = '22^=i IP = ai^ Vi £ By 

taking the telescoping sum of (35), we further obtain 


^k+l < ;^Ao 


1 

H— 

V L 




g=0 ^ i=l 


^ '' 1=0 ' i&Ak-e 


,k-e+i_^k-eu 2 _ { 27 + Np 


(fc-£),+l|,2 ,1 1 


pE? 

/ D—r\ • 


-fc—£+1 y^k—i 


E 

^=0 ' j6A|_ 


£=0 

k-t A^-^)i+^\\2 


J—^1 ^ 

4p2A^ 2^ pt 2^ '' 

^ i=0 ' ieA£_ 


(36a) 


{k-t),+l {k-l)^,,2 

“ II 2 


(36b) 


(36) 


(36c) 
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The three terms (36a), (36h), and (36c) in the right hand side (RHS) of (36) can respectively be 
bounded as follows, using Lemma 2. Consider the change of variable k — £ = j. Then, we have the 
following chain for (36a): 

U 

1 


(36a) = E 


£=0 i^Ak-£ 


_ (fc-Q,+i||2 
1-^0 •^0 II 


i=0 i€Aj 

k-1 


Ko -^0 II 


= Sir-1)7] 


j=0 


k 


V 


£=1 ' 


|™i _ v+^l|2 
1-^0 •^0 II 


k-e ^fc-^+l||2 


1*0 - *0 


(37) 


where the inequality is obtained by applying (31) with v = t, Mj = Aj, N = S, and ji = ji which 
satisfies j — t < ji < j (see (24)); to obtain the last equality, the change of variable k — £ = j is applied 
again. 

Analogously, by applying (31) with v = 2t — 1, J\fj = Ap N = N, and ji = ji (which satisfies 
y — 2t +1 < ji < j since ji — T< ji < ji and j — t < ji < j by (24) and (25)), one can bound (36b) as 


P6'>) = E;!? E 


A-£ 


7 ]'- 

e=o ' ieAL 


E^'E 

j=0 iGAj 


J ^L+l||2 


£32iV(^-l)Ey+‘f- 


A(r-l) _ 


V 7] — 1 


= 2N{t — 1)7] 


j=0 

„2(r-l)_^N fc 


\A _ V+L|2 
Ko *0 II 


7] — 1 


Eii 


k-i _ fc-^+l||2 

Xq a,Q II . 


e=i 


The term (36c) can be bounded as follows 

k 


(36c) = 2^^ 2^ II*) -*i 11 


Tjt. 

=0 i€Al_, 
. k 


E^'E 


— xi' |P 


i=0 i&A'] 

k 


TtEE"?' " ^IT'■ l\xr • ^ - x- 


+ + ^ _ ^>||2 


j=o ieA] 


(38) 
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j=0 i&A'= 

N k 

' i=l j=0 
N k 


<,'-2(r- 


= ?7- 


V 


(39) 


i=l £=0 

where, in the first inequality, we have used the fact of j — r + 1 < ji < j from (25). To show the second 
inequality, notice that for any i G Aj, it also satisfies i G for ^ = y’j + 1,..., j. So, ji = for 
£ = ji + 1, ... ,j- Since j — t < ji < j, each — xfW"^ appears no more than r — 1 times in 

the summation Y.j=o J2ieA- IP- 

By substituting (39), (38) and (37) into (36), we obtain 


1 


Afc+i < ^Ao 


1 

+ - 
V 




^r-i _ 

rj — 1 


+ (r - 1)7? 


,2(r-l) _ I 


r] — 1 


•'-misf 


X 


k-£+l ^k-£\\2 


1 

+ - 
V 


V 2 


L2 


0 


— X 


0 


+ v^-^{T-iy 


N k ^ N 

Ao‘^IV ^ ^ ^ ft£ ^ ^ 

^ J i=l £=0 I i=l 

Let e = 1/p. Therefore, we see that (16) is true if 

7 > (r - 1)77 


\ ,-£+!_ ^k-£\\2^ 


r^5(i+p2)+5/iv^ 

/p-l-l\ 

[v 2 ) 

V ^-1 y 


+ 


,2(r-l) _ I 


Np 


77 — 1 
2L2 L2 


+ 1 ) 


p> (1 + L^) + —+ 
Let P > w A Then (42) holds true if 


2p^N 


l + p"-'(r-l) 


^ 2 . 27.2/ 2 + 2p^-i(r-l) 


p> (1 + L2) +- 1 + 

P \ 


1 + SiVo-^ 


(40) 


(41) 

(42) 

(43) 
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Moreover, since 7 > 8N{p — cr^) and 5 > 1, we see that rj has an upper hound 




< 2. 


( 44 ) 


Therefore, (14) and (15) are sufficient conditions for (43) and (41), respectively. The proof is thus 
complete. ■ 


C. Proof of Theorem 2 

The key is to huild a similar result as Lemma 3 under Assumption 4. Now, consider Assumption 4. 
Let X* he an optimal solution to (1), and let 


y* = AiX*, i = 


Then, {y*,... ,y^) is unique since gfs are strongly convex. So, the optimal solution set to (2) can he 
defined as 


A* 


(Xq^ Xi^ • • • ? 


y*i 


Aixi 



AnXn 


Xi = xq, i = 


(45) 


Let I 7 V +1 <Xi V*{x) be the projection point of :k = {xq,xJ, ... ,xjj)'^ onto X*, where (g) denotes the 
Kronecker product. It can be shown that the following lemma is true. 


Lemma 4 Under Assumption 4, for any x G it holds that 


N 


N 


N 


fi{'P*{x)) > fi{Xi) + 


2 = 1 


2 = 1 
N o 


2 = 1 


■ 1 2c 
2=1 

2 N 




^ 0 ] 


VE 


i=l 


\Xi - Xo\ 


for some finite constant c > 0 . 


(46) 


Proof: See Appendix D. ■ 

Lemma 4 implies that the structured /j’s in Assumption 4 own an analogous property as the strongly 
convex functions in (32). Based on Lemma 4, the next lemma shows that one can still bound as 

in Lemma 3 under Assumption 4. 
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Lemma 5 Suppose that Assumptions 2 and 4 hold, and assume that 7 > 8N{p — cr'^lc) + ANa'^ and 6 
satisfies 


6 > max 



(47) 


Then, (33) holds true. Instead, 1/7 = 0 and 6 > max{(c/o)/f 7 ^ — 1,1}, then 


_II fe+l _ fe||2 

2N[2{p - ayc)6 + ct 2 ] “ 2pm " 




(48) 


iGAl 

Proof: See Appendix E. 


Given Lemma 5, Theorem 2 can be proved by following exactly the same steps as for Theorem 1 in 


Section IV-B. The details are omitted here. 


V. Numerical Results 

In this section, we present some simulation results to examine the practical performance of the AD- 
ADMM. We consider the following LR problem 


m 


'^log {l + exp{-yjajw)) 
f=i 


(49) 


mm 

luSW 


where yi, ■ ■ ■ ,ym are the binary labels of the m training data, w G is the regression variable and 
Ai = [ai,..., G ]^mxn training data matrix. We used the MiniBooNE particle identification 
Data Set' which contains 130065 training samples (m = 130065) and the learning parameter has a size 
of 50 (n = 50). The constraint set W is set to W = {m G M"' | < 10 Vi = l,...,n}. The 

AD-ADMM is implemented on an HP ProLiant BL280c G6 Linux Cluster (Itasca HPC in University of 
Minnesota). The n training samples are uniformly distributed to a set of N workers {N = 10,15,20). 
Lor each worker, we employed the fast iterative shrinkage thresholding algorithm (LISTA) [25] to solve 
the corresponding subproblem (10). The stepsize of LISTA is set to 0.0001 and the stopping condition 
is that the 2-norm of the gradient is less than 0.001. The penalty parameter p of the AD-ADMM is set 
to 0.01. Interestingly, while the theoretical convergence conditions in [23, Theorem 1] and Theorem 1 

*https;//archive.ics.uci.edu/ml/datasets/MiniBooNE-l-particle-l-identification 
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(a) 4 = 1 (b) A = 1 




(C) T — 11 


(d) r = 11 


Fig. 1: Convergence curves of Algorithm 1 for solving the LR problem (49) on the Itasca computer 
cluster; 6 = 0.1, p = 0.01 and 7 = 0 . 


all suggest that the penalty parameter 7 should he large in the worst-case, we find that, for the problem 
instance we test here, it is also fine to set 7 = 0 . 

Note that the asynchrony in our setting comes naturally from the heterogeneity of the computation 
times of computing nodes. In our experiments, analogous to [22], we further constrained the minimum 
size of the active set Ak by \Ak\ > A where A G [1, A^] is an integer. When A = N, it corresponds to 
the synchronous case where the master is forced to wait for all the workers at every iteration. 

Figure 1(a) and Figure 1(b) respectively display the convergence curves (objective value) of the AD- 
ADMM versus the iteration number and the running time (second), for various values of N and r. Here we 
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(b) Waiting time, = 10 



(c) Computation time, W = 20 


(d) Waiting time, = 20 


Fig. 2: The master’s computation and waiting times for solving the LR problem (49) over the Itasca 
computer cluster. 


set A = 1. One can observe from Figure 1(a) that, in terms of the iteration number, the convergence speed 
of the AD-ADMM slows down when r increases. However, as seen from Figure 1(b), the AD-ADMM 
is actually faster than its synchronous counterpart (r = 1), and the running time of the AD-ADMM 
can be further reduced with increased r. We also observe that, when N increases, the advantages of 
the AD-ADMM compared to its synchronous counterpart reduces. This is because the computation load 
allocated to each worker decreases with N (as n is fixed), making all the workers experience similar 
computation delays. 

In Figure 1(c) and Figure 1(d), we present the convergence curves of AD-ADMM with different values 
of A. We see from Figure 1(c) that when A increases, it always requires fewer number of iterations to 
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achieve convergence for all choices of parameters. From Figure 1(d), however we can observe that a 
larger value of A is not always heneficial in reducing the running time. Specifically, one can see that 
for N = 10, the running time of AD-ADMM decreases when one increases A from 1 to 2, whereas the 
running time increases a lot if one increases A to 4. One can observe similar results for = 15 and 
N = 20. 

To look into how the values of r and A impact on the algorithm speed, in Figure 2, we respectively plot 
the computation time and the waiting time of the master node for various pairs of (r, A). The setting is the 
same as that in Figure 1, except that here the stopping condition of the AD-ADMM is that the objective 
value achieves 4.56 x 10^. One can observe from these figures fhat, when r increases, the computing 
load of the master also increases but the waiting time is significantly reduced. This explains why in 
Figure 1(b) the AD-ADMM requires a less running time compared with the synchronous ADMM. On 
the other hand, when A increases, the computation time of the master always decreases. This is because 
the master may take a smaller number of iterations to reach the target objective value (see Figure 1(c)) 
and have to spend more time waiting for slow workers. Flowever, the overall waiting time of the master 
does not necessarily become larger or smaller with A. As seen from Figure 2(b) and Figure 2(d), when 
A increases from 1 to 2, the waiting time for = 10 in Figure 2(b) increases, whereas the waiting 
time for A^ = 20 in Figure 2(d) decreases. However, for A = 4, the waiting times always become larger. 
Nevertheless, when comparing to the synchronous ADMM (i.e., (r, A) = (1,A^)), we can see that the 
waiting time of the master in the AD-ADMM is always much smaller. 

VI. Conclusions 

In this paper, we have analytically studied the linear convergence conditions of the AD-ADMM 
proposed in [23]. Specifically, we have shown that for strongly convex /j’s (Assumption 3) or for //s 
with the composite form in Assumption 4, the AD-ADMM is guaranteed to converge linearly, provided 
that the penalty parameter p and the proximal parameter 7 are chosen sufficiently large depending on the 
delay r. When the delay r is bounded and N is large, we have further shown that linear convergence 
can be achieved with zero proximal parameter (i.e. ,7 = 0), and with a delay-independent p. The linear 
convergence conditions and the linear rate have been given explicitly, which relate the algorithm and 
network parameters with the algorithm worst-case convergence performance. The presented numerical 
examples have shown that in practice the AD-ADMM can effectively reduce the waiting time of the master 
node, and as a consequence improves the overall time efficiency of disfributed optimization significantly. 
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Appendix A 

Bound of Consensus Error 


We bound the size of the consensus error Ir following lemma. 


Lemma 6 Under Assumption 2, it holds that 


N 

E 

2=1 


-^0 II — 


2L 


ZLy V ■\ 




2€-4.fc 


+ 


20 


E 

i&Al 




+^E 

i&Ak 


Ai +1 .v.fc ||2 

^oll 


+^E 


£Cr 


-a;g||2 + 4iV||4+i- 


Xf, 


i&At 


Proof: It follows from (22) and (29) that the following chain is true 


N 

E 

2=1 


||2 

l-^i “^0 II 


Z^ll-^i -^0 “'“■^0 -^0 II 
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■r Z^ ll-^i -^0 -^0 *^0 II 
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ieAl 
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s^E 

|A^^-Af||2 + 1 V 

||Af^+i-Af-| 

|2 

isAfc 



+ 4EI 

|4^+1-4||2 + 4^ II 

^^||2 
•Eq -eoII 

+ 4iV|| 


iSAfc 




_ /Y* ^ I ^ /yt 

■eq “T -eq -Eq 


fc+l||2 


pk+t _ ,y,k\\2 
L-O a-Qll . 


Recall from [23, Eqn. (38)] that 

Vfi{x'l+^) + = 0 Vi e V and VA:. 

By substituting (A.3) into (A.2) and by the Eipschitz continuity of V/j, we obtain (A.l). 


(A.l) 


(A.2) 


(A.3) 
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Appendix B 
Proof of Lemma 2 


It is easy to show the following chain is true 


k j —1 


j=0 i&Mj i=0 iGA/'j Ij=ji+1 

- ji -^) -''+^112 

j=0 i&Mj g=jt+i 

j=0 i&N'j 9=^—1^+1 

^ J=0 q=j-i/+l 


®5 -®o 


1^1 _ ^ 9 + 11|2 

Ko -^0 II 


(A.4) 


where the second inequality is owing to j — v < ji. To proceed, we list r]^ X]q=J-i/+i ll®o “ 


„9_^9+l||2 ^ II n_^l||2 


0 “ ‘*'01 


j = 1,... ,u,..., helow 
0 

2 = 1, V Y 

q=2—v 

2 = 2 , rf \\xl-xYf =r}^\\xl-xlf+7^‘^\\xl-xlf 

q=^—v 


j = u-l, r]’' 

q=0 
u-1 

2=2^, 

9=1 


^9+l||2 _ I/-1|| 0 ^1||2 I „i^-l||^l ^2||2 j_ y-ln u-2 ™i'-l||2 

"*"0 ~ *^0 II “V 11*^0” *^011 V 11*^0” *^011 -h// ||Xq — •J^o II 


Q+1||2 1^11 1 2||2 I u\\ 2 3||2 , , u\\ v-l v\\2 

Xq-xI II =r/ 11 * 0 -aioll +v\\xo-Xq\\ +■■■+!] \\x„ - xJ 


(A.5) 


One can verify that each ||*q — Xq p appears no more than — 1 times in the summation term 

Z^1=0'/ l^g=j-u+l\\Xo x^ 


Uj={i / ^q=q 

upper hounded hy 


9 _ ^9+11|2 

0 II 


and therefore the total contrihution of each ||*o “ ®o^^lP 1*^ 


(y^i+i y^i+2 - 

This shows that 


,i+i^-iq|^2 _ a;'i’''^ll^ = r,2+l/ ^ ^ \ II'V.J _ ^2+11|2 


b - ®b II = T 


r/ — 1 


\Xo - Xo 


1^9 _ ^9+11|2 
l-^'O Xq II 


k j-l 

Y^ Y 

2=0 q=j-iy+l 

By substituting (A.7) into (A.4), we obtain (31). 


^1 / u —1 1 

2=0 ^ ' 


|™2 _ ^2+11|2 
1*0 Xq II . 


(A.6) 


(A.7) 
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Appendix C 
Proof of Lemma 3 

By the optimality condition of (21) [24] , one has, Vi £ Ak and Va;, £ 

0 > (V/.(a;‘+‘) + + p(a:‘+‘ - a; - x.) 

= (V/,{x‘+‘))^(xf+> - X,) + (A‘+‘)^(x‘+‘ - Xj), <A.8) 

where the equality is due to (22). Similarly, hy the optimality condition of (28) and hy (29), one has, 

Vi £ A^. and \/xi £ R", 

0 > - Xi) 

= - Xi) + - X,). (A.9) 

Summing (A.8) and (A.9) for all i £ V gives rise to 

N N 

E(V/.(xf+‘))^(x*+> - Xj) + y;(Af+>f {x‘+‘ - Xi) 

i=\ i=l 

<0 V(a;i,...,a;jv) £ R”^. (A.IO) 

In addition, hy the optimality condition of (23) [7, Lemma 4.1], one has, Vtco £ R”, 

N 

h{xl+^) - h{xo) - J]](Af+^)^(a3^+^ - xo) 

i=l 

N 

- pY,{x\+^ - xl+^f {xl+^ - Xo) 

i=l 

+ i{xq'^^ - Xq)'^{xq'^^ - Xo) <Q. (A.ll) 

Denote x* £ R"' as an optimal solution to problem (1). Let xi = ■ ■ ■ = x^ = xq = x* in (A.IO) and 

(A.ll), and combine the two equations. We obtain 

N 

- X*) + /i(4+i) - h{x*) 

i=l 

N 

i=l 

N 

-py](x*+>-xS+>f{xj+‘-x*) 

i=l 

+ <0. (A.12) 
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Let y = X* and x = x^ in (32) for all z G V, and apply them to (A. 12). We have 

, N N 

0 > ( 5] h{x\+^) + /i(4+') - 5] /i(ai*) - h{x^ 


i=l 
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i=l 


i=l 

N 
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Y {x^+^ - X*) 


Note that, hy (20), 
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— ^ 11 ^^ _ ' r -*||2 I ^ _ ' r -^||2 

2 ll'^o II “T 2 •^0II • 

By substituting (A.14) and (A.15) into (A.13) and recalling Cp in (12), we obtain 


Afc+i < 
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2 N 
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W^k+l ^k\\2 ^ ^ \\^k+l ^*||2 


- - UEn - ®n - 


We bound the term “ a;*|p as 

N N 
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(by (20)) 
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(A. 16) 
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i&Al 


where the last inequality is obtained hy assuming 5 > 1. Besides, we hound the term ^||a;Q — a;*|p in 


the RHS of (A. 16) as 


^ l+fc _ ^*||2 _ ^ ll^fc _ ^ fc +1 I ^ fc +1 _ ^*||2 

21+0 II ~ 2 "^o “G "^o II 


< 2(1 + , 


„ . 2 ,- . J, 

By substituting (A.17) and (A.18) into (A.16), one obtains 


fe+l||2 


+ J(1 + 
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I4+1-® 
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25 


2 (i + ^))ll4^^-* 
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2{p — a^)6Lp‘ 
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i&At 




+ 4(p - 4)5 ^ \\xq - Xq'^^W^ + 4(p - 4)5 ^ ||4 - xl'^^f. (A.19) 

Let 5 > 1 be large enough so that “ ^^(1 + |) E 0 and assume that 7 > 8(/9 — a‘^)N. Then, one 
obtains (33) from (A.19). 

To show (34), let 7 = 0 in (A.19) and assume that 5 > 1 be large enough so that | — fj^(l + |) < 0. 


Appendix D 
Proof of Lemma 4 

Since X* is a linear set, according to the Hoffman bound [26], for some constant c > 0, 

N 

dist^(A’*,®) = ^ W'P^'ix) - Xi\\^ + \\V*{x) - xqW^ 

i=l 

N N 

< \\AiXi - y*|p + \\xi - CKof. (A.20) 
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In addition, it follows from the strong convexity of gi’s that 
N N 

^MV*ix)) = Y,9^iA^r*{x)) 
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N 9 
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= “ AiXif. 

2 = 1 2=1 2=1 

By substituting (A.20) into (A.21), one obtains (46). 


(A.21) 


Appendix E 
Proof of Lemma 5 


By applying (46) (with Xi = x^~^^ Vi = 0,1,... ,N) to (A.12), and following the same steps as in 
(A.13)-(A.16), we have 


2 / 

Ai+I < ^ ||;„‘+1 _ V'(x)f + l\\xl - P*(, 


X] 
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(A.22) 


Recall (A.17), (A.18) (with x* replaced by V*{x)) and (A.l) in Lemma 6 and apply them to (A.22). 
One obtains 
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Let 5 > 1 be large enough so that ~ < 0 In addition, since 7 > SN{p — cr^/c) + ANa"^ 

implies 7 > 8iV(p — cr^/c) + ANa'^/5, (A.23) infers (33). 

To obtain (48), let 7 = 0 in (A.23) and assume that <5 > 1 be large enough so that f “ ^(1 + ^) < 0. 
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Algorithm 1 Asynchronous Distributed ADMM for (2). 

1 : Algorithm of the Master: 

2: Given initial variable and broadcast it to the workers. Set /c = 0 and di = ■ ■ ■ = = Q', 

3: repeat 

4: wait until receiving {xi, from workers i G Ak and that dj < r — 1 Vi G A^. 

5: update 






di = 


x: 


fc+i _, 


Xi 

Vi G Ak 

x- 

^i€Al 


Vi G Ak 


yieAl 


Vi G Ak 

+ 1 

Vi G Al 

r min < h(xr\) 


+ £ 'T^ 

2 l^i=l 


ue; 


fc+i 


T \ k+1 


- aJolP + ^xq - x^W'^ 


6: broadcast x^'^^ to the workers in Ak- 

7: set h i — /l -f 1. 

8: until a predefined stopping criterion is satisfied. 

1: Algorithm of the ith Worker: 

2: Given initial and set /c* = 0. 

3: repeat 

4: wait until receiving xq from the master node. 

5: update 


( 6 ) 

( 7 ) 

( 8 ) 

( 9 ) 


*^-+1 = arg min fi{xi) + xjxf + ^\\xi - XqW^, 

A^+i = Af+p(a:f+i-£o). 

6: send to the master node. 

7: set ki ki + 1. 

8: until a predefined stopping criterion is satisfied. 


( 10 ) 

( 11 ) 
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